library(stopwords)
library(tidyverse)
library(tidytext)
library(textdata)ETC1010/5510 Tutorial 9 Solution
Introduction to Data Analysis
🎯 Workshop Objectives
- Representing pieces of text as a tidy data object
- Remove uninteresting words from a body of text
- Investigating most frequently used words in a body of text
- Perform sentiment analysis to find negative or positive work
🔧 Instructions
- In each question, you will replace ’___’ with your answer. Please note that the Rmd will not knit until you’ve answered all of the questions.
- Once you have filled up all the blanks, remember to go to
knitr::opts_chunkat the top of the document, changeeval = TRUE, then knit the document. - Exercise 9D is optional, and you can work on it at your own pace.
Install the necessary packages.
🥡 Exercise 9A: Tidy text
Tokenising text
text <- c("This will be an uncertain time for us my love",
"I can hear the echo of your voice in my head",
"Singing my love",
"I can see your face there in my hands my love",
"I have been blessed by your grace and care my love",
"Singing my love")
text[1] "This will be an uncertain time for us my love"
[2] "I can hear the echo of your voice in my head"
[3] "Singing my love"
[4] "I can see your face there in my hands my love"
[5] "I have been blessed by your grace and care my love"
[6] "Singing my love"
text_df <- tibble(line = seq_along(text), text = text)
text_df# A tibble: 6 × 2
line text
<int> <chr>
1 1 This will be an uncertain time for us my love
2 2 I can hear the echo of your voice in my head
3 3 Singing my love
4 4 I can see your face there in my hands my love
5 5 I have been blessed by your grace and care my love
6 6 Singing my love
unnest_tokens() takes a character vector and unnests it into a tidy data frame.
What’s going on in these examples?
text_df %>%
unnest_tokens(
output = word,
input = text,
token = "words" # default option
) # A tibble: 49 × 2
line word
<int> <chr>
1 1 this
2 1 will
3 1 be
4 1 an
5 1 uncertain
6 1 time
7 1 for
8 1 us
9 1 my
10 1 love
# ℹ 39 more rows
text_df %>%
unnest_tokens(
output = word,
input = text,
token = "characters"
)# A tibble: 171 × 2
line word
<int> <chr>
1 1 t
2 1 h
3 1 i
4 1 s
5 1 w
6 1 i
7 1 l
8 1 l
9 1 b
10 1 e
# ℹ 161 more rows
Look at the help documentation for the unnest_tokens() function and read the options that you can use for the ‘token’ argument.
?unnest_tokenstext_df %>%
unnest_tokens(
output = word,
input = text,
token = "ngrams",
n = 2
)# A tibble: 43 × 2
line word
<int> <chr>
1 1 this will
2 1 will be
3 1 be an
4 1 an uncertain
5 1 uncertain time
6 1 time for
7 1 for us
8 1 us my
9 1 my love
10 2 i can
# ℹ 33 more rows
text_df %>%
unnest_tokens(
output = word,
input = text,
token = "ngrams",
n = 3
)# A tibble: 37 × 2
line word
<int> <chr>
1 1 this will be
2 1 will be an
3 1 be an uncertain
4 1 an uncertain time
5 1 uncertain time for
6 1 time for us
7 1 for us my
8 1 us my love
9 2 i can hear
10 2 can hear the
# ℹ 27 more rows
Use unnest_tokens() to help you answer the following questions from the two paragraphs of text below:
dickens <- "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way - in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."
burns <- c("This is a thousand monkeys working at a thousand typewriters. Soon, they'll have finished the greatest novel known to man.
'All right, let's see... It was the best of times, it was the BLURST of times?' You stupid monkey.")
quotes_df <- tibble(from = c("Dickens", "Simpsons"),
text = c(dickens, burns))1. How many words are in each quote?
quotes_df %>%
unnest_tokens(output = word,
input = text) %>%
count(from)# A tibble: 2 × 2
from n
<chr> <int>
1 Dickens 119
2 Simpsons 39
2. How many times does the trigram “it was the” occur?
quotes_df %>%
unnest_tokens(output = trigram,
input = text,
token = "ngrams",
n = 3) %>%
filter(trigram == "it was the") %>%
count(from)# A tibble: 2 × 2
from n
<chr> <int>
1 Dickens 10
2 Simpsons 2
Stop Words
- In computing, stop words are words which are filtered out before or after processing of natural language data (text).
- They usually refer to the most common words in a language, but there is not a single list of stop words used by all natural language processing tools.
Let’s look at the list of stop words from the tidytext package.
stopwords_english <- get_stopwords()
stopwords_english# A tibble: 175 × 2
word lexicon
<chr> <chr>
1 i snowball
2 me snowball
3 my snowball
4 myself snowball
5 we snowball
6 our snowball
7 ours snowball
8 ourselves snowball
9 you snowball
10 your snowball
# ℹ 165 more rows
Here is an alternative dictionary of stop words from a different source smart.
stopwords_smart <- get_stopwords(source = "smart")
stopwords_smart# A tibble: 571 × 2
word lexicon
<chr> <chr>
1 a smart
2 a's smart
3 able smart
4 about smart
5 above smart
6 according smart
7 accordingly smart
8 across smart
9 actually smart
10 after smart
# ℹ 561 more rows
In the sentence “This will be an uncertain time for us my love”, how many of these words are not stopwords?
Step 1: Break up individual words
uncertain <- text_df %>%
filter(line == 1) %>%
unnest_tokens(word, text)
uncertain# A tibble: 10 × 2
line word
<int> <chr>
1 1 this
2 1 will
3 1 be
4 1 an
5 1 uncertain
6 1 time
7 1 for
8 1 us
9 1 my
10 1 love
Step 2: Remove the stop words with an anti-join from dplyr
uncertain %>%
anti_join(stopwords_english)# A tibble: 4 × 2
line word
<int> <chr>
1 1 uncertain
2 1 time
3 1 us
4 1 love
If you haven’t used anti_join() before, have a look at the help documentation to see what it does.
?anti_joinUsing the quotes data frame we defined earlier, answer the following questions:
- How many words are there in each quote after removing stop words?
quotes_no_stopwords <- quotes_df %>%
unnest_tokens(output = word,
input = text) %>%
anti_join(stopwords_smart)
count(quotes_no_stopwords, from)# A tibble: 2 × 2
from n
<chr> <int>
1 Dickens 35
2 Simpsons 13
- What is the most frequent word in each of the quotes after removing stop words?
quotes_no_stopwords %>%
count(from, word, sort = TRUE)# A tibble: 40 × 3
from word n
<chr> <chr> <int>
1 Dickens age 2
2 Dickens direct 2
3 Dickens epoch 2
4 Dickens period 2
5 Dickens season 2
6 Dickens times 2
7 Simpsons thousand 2
8 Simpsons times 2
9 Dickens authorities 1
10 Dickens belief 1
# ℹ 30 more rows
- What is the most frequent word across both quotes after removing stop words?
quotes_no_stopwords %>%
count(word, sort = TRUE)# A tibble: 39 × 2
word n
<chr> <int>
1 times 4
2 age 2
3 direct 2
4 epoch 2
5 period 2
6 season 2
7 thousand 2
8 authorities 1
9 belief 1
10 blurst 1
# ℹ 29 more rows
Sentiment
One way to analyze the sentiment of a text is to consider the text as a combination of its individual words
and the sentiment content of the whole text as the sum of the sentiment content of the individual words
essentially a dictionary where different words are categorized either as positive or negative or on a numeric scale
# This function is for bypassing the interactive menu when knitting the document,
# you don't need it if you are running the code chunk by chunk interactively.
get_sentiments <- function(dict_name) {
if (!file.exists(paste0(dict_name, ".rds"))) {
textdata:::download_functions[[dict_name]](tempdir())
textdata:::process_functions[[dict_name]](tempdir(), paste0(dict_name, ".rds"))
}
readRDS(paste0(dict_name, ".rds"))
}# If you're asked if you want to download the database, please select yes, option 1 then enter.
afinn <- get_sentiments("afinn") # numeric
afinn %>% filter(value == 5) # example of very positive words. Have a go at changing the 5 to other numbers negative or positive and see what you get)# A tibble: 5 × 2
word value
<chr> <dbl>
1 breathtaking 5
2 hurrah 5
3 outstanding 5
4 superb 5
5 thrilled 5
bing <- get_sentiments("bing") # categorical
bing# A tibble: 6,789 × 2
word sentiment
<chr> <chr>
1 2-faced negative
2 2-faces negative
3 abnormal negative
4 abolish negative
5 abominable negative
6 abominably negative
7 abominate negative
8 abomination negative
9 abort negative
10 aborted negative
# ℹ 6,779 more rows
After tokenising into words, use a left/inner join to get the words sentiments.
Stopwords don’t have a sentiment associated, but also there are missing values when a word doesn’t match the dictionary.
Here, let us analyze the sentiment of the first line of text_df: “This will be an uncertain time for us my love”.
text_df %>%
filter(line == 1) %>%
unnest_tokens(word, text) %>%
left_join(afinn)# A tibble: 10 × 3
line word value
<int> <chr> <dbl>
1 1 this NA
2 1 will NA
3 1 be NA
4 1 an NA
5 1 uncertain -1
6 1 time NA
7 1 for NA
8 1 us NA
9 1 my NA
10 1 love 3
text_df %>%
filter(line == 1) %>%
unnest_tokens(word, text) %>%
left_join(bing)# A tibble: 10 × 3
line word sentiment
<int> <chr> <chr>
1 1 this <NA>
2 1 will <NA>
3 1 be <NA>
4 1 an <NA>
5 1 uncertain negative
6 1 time <NA>
7 1 for <NA>
8 1 us <NA>
9 1 my <NA>
10 1 love positive
Using the quotes we looked at above (quotes_no_stopwords), use the “afinn” lexicon to compute the average sentiment of each quote. Which one is considered more positive?
quotes_no_stopwords %>%
left_join(afinn) %>%
# now for each quote we want to summarise the average value
group_by(from) %>%
summarise(mean = mean(value, na.rm = TRUE))# A tibble: 2 × 2
from mean
<chr> <dbl>
1 Dickens -0.429
2 Simpsons 0.5
Analysing reviews of a video game
This is a continuation of the example we looked at in the lecture:
User and critic reviews for the game Animal Crossing scraped from Metacritc
This data comes from a #TidyTuesday challenge.
We can read the data into R directly using the following URLs:
(Note this requires an internet connection to work)
critics <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/critic.tsv')
user_reviews <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/user_reviews.tsv')Go through the process of using the critics data to look at the following:
critic_words <- critics %>%
unnest_tokens(output = word, input = text)
critic_words# A tibble: 5,741 × 4
grade publication date word
<dbl> <chr> <date> <chr>
1 100 Pocket Gamer UK 2020-03-16 animal
2 100 Pocket Gamer UK 2020-03-16 crossing
3 100 Pocket Gamer UK 2020-03-16 new
4 100 Pocket Gamer UK 2020-03-16 horizons
5 100 Pocket Gamer UK 2020-03-16 much
6 100 Pocket Gamer UK 2020-03-16 like
7 100 Pocket Gamer UK 2020-03-16 its
8 100 Pocket Gamer UK 2020-03-16 predecessors
9 100 Pocket Gamer UK 2020-03-16 operates
10 100 Pocket Gamer UK 2020-03-16 outside
# ℹ 5,731 more rows
1. What are the most used words over the collection of reviews?
critic_words %>%
count(word, sort = TRUE)# A tibble: 1,393 × 2
word n
<chr> <int>
1 the 280
2 and 193
3 to 166
4 a 165
5 new 135
6 of 135
7 is 113
8 animal 111
9 crossing 109
10 horizons 94
# ℹ 1,383 more rows
2. What are the most used words, after removing stop words?
critics_words_no_stop <- critic_words %>%
anti_join(stopwords_smart)
count(critics_words_no_stop, word, sort = TRUE)# A tibble: 1,113 × 2
word n
<chr> <int>
1 animal 111
2 crossing 109
3 horizons 94
4 game 61
5 series 35
6 island 31
7 time 26
8 nintendo 22
9 life 19
10 experience 18
# ℹ 1,103 more rows
3. Plot the distribution of word frequencies over the collection of reviews
critic_words %>%
count(word, sort = TRUE) %>%
ggplot(aes(x = n)) +
geom_histogram()From the plot above you can see that most words are only used once. This is pretty common.
4. What is the longest review and what is the shortest review?
critic_words %>%
count(publication, sort = TRUE) %>%
filter(n == max(n) | n == min(n))# A tibble: 2 × 2
publication n
<chr> <int>
1 Nintendo Life 140
2 Gamer.no 2
5. Read the text of the shortest review
critics %>%
filter(publication == "Gamer.no") %>%
pull(text) [1] "Quotation forthcoming."
6. Using “afinn”, add sentiment values to each word in a review
critic_words <- critic_words %>%
left_join(afinn)7. For each publication, compute the average sentiment for the review
sentiment_avg <- critic_words %>%
group_by(publication) %>%
summarise(mean_sentiment = mean(value, na.rm = TRUE), # value is the average sentiment
n_missing_words = sum(is.na(value)),
n_words = n())8. Are longer reviews more positive?
sentiment_avg %>%
ggplot(aes(x = n_words, y = mean_sentiment)) +
geom_point()9. Do the grades correlate with the review score? Are there any reviews with negative sentiments but high scores?
critics %>%
left_join(sentiment_avg, by = "publication") %>%
ggplot(aes(x = grade, y = mean_sentiment)) +
geom_point()📚 Exercise 9B: Austen Book
The books of Jane Austin
In this lab exercise, we will analyse the sentiment of Austen’s books. Below is the code to tokenise the books and add line numbers and chapters.
library(janeaustenr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(
str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))
) %>%
ungroup() %>%
unnest_tokens(word, text)1. Add sentiment categories to all the books using the “nrc” lexicon.
nrc <- get_sentiments("nrc")2. What are the most common “anger” words used in Emma?
Emma_nrc <- tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc, by = "word")
Emma_nrc %>%
filter(sentiment == "anger") %>%
count(word, sort = TRUE)# A tibble: 313 × 2
word n
<chr> <int>
1 ill 72
2 bad 60
3 feeling 56
4 bear 52
5 words 49
6 obliging 34
7 evil 33
8 difficulty 30
9 spite 24
10 loss 23
# ℹ 303 more rows
3. What are the most common “surprise” words used in Emma?
Emma_nrc %>%
filter(sentiment == "surprise") %>%
count(word, sort = TRUE)# A tibble: 165 × 2
word n
<chr> <int>
1 good 359
2 hope 143
3 deal 92
4 present 89
5 spirits 64
6 marry 63
7 leave 58
8 feeling 56
9 smile 44
10 pleasant 41
# ℹ 155 more rows
Using another lexicon (“bing”, or “afinn”), compute the proportion of positive words in each of Austen’s books.
4. Which book is the most positive and which is the most negative?
wordcounts <- tidy_books %>%
group_by(book) %>%
summarise(total_book_words = n())
tidy_books %>%
inner_join(get_sentiments("bing"), by = "word") %>%
group_by(book, sentiment) %>%
summarize(sentiment_word_count = n()) %>%
left_join(wordcounts, by = "book") %>%
mutate(ratio = sentiment_word_count / total_book_words) %>%
group_by(sentiment) %>%
top_n(1, ratio)# A tibble: 2 × 5
# Groups: sentiment [2]
book sentiment sentiment_word_count total_book_words ratio
<fct> <chr> <int> <int> <dbl>
1 Emma positive 7157 160996 0.0445
2 Northanger Abbey negative 2518 77780 0.0324
📚 Exercise 9C: The Simpsons
The Simpsons data set is available as below.
scripts <- read_csv("data/simpsons_script_lines.csv")
chs <- read_csv("data/simpsons_characters.csv")
sc <- left_join(scripts, chs, by = c("character_id" = "id"))
sc# A tibble: 158,264 × 16
id episode_id number raw_text timestamp_in_ms speaking_line character_id
<dbl> <dbl> <dbl> <chr> <dbl> <lgl> <dbl>
1 9549 32 209 Miss Hoov… 848000 TRUE 464
2 9550 32 210 Lisa Simp… 856000 TRUE 9
3 9551 32 211 Miss Hoov… 856000 TRUE 464
4 9552 32 212 Lisa Simp… 864000 TRUE 9
5 9553 32 213 Edna Krab… 864000 TRUE 40
6 9554 32 214 Martin Pr… 877000 TRUE 38
7 9555 32 215 Edna Krab… 881000 TRUE 40
8 9556 32 216 Bart Simp… 882000 TRUE 8
9 9557 32 217 (Apartmen… 889000 FALSE NA
10 9558 32 218 Lisa Simp… 889000 TRUE 9
# ℹ 158,254 more rows
# ℹ 9 more variables: location_id <dbl>, raw_character_text <chr>,
# raw_location_text <chr>, spoken_words <chr>, normalized_text <chr>,
# word_count <chr>, name <chr>, normalized_name <chr>, gender <chr>
Section A:
1. Count the number of times a character speaks
sc %>%
count(name, sort = TRUE)# A tibble: 6,722 × 2
name n
<chr> <int>
1 Homer Simpson 30104
2 <NA> 17522
3 Marge Simpson 14265
4 Bart Simpson 13967
5 Lisa Simpson 11641
6 C. Montgomery Burns 3207
7 Moe Szyslak 2863
8 Seymour Skinner 2443
9 Ned Flanders 2145
10 Grampa Simpson 1957
# ℹ 6,712 more rows
2. Are there missing names?
Yes - these are not speaking lines
sc %>%
filter(is.na(name))# A tibble: 17,522 × 16
id episode_id number raw_text timestamp_in_ms speaking_line character_id
<dbl> <dbl> <dbl> <chr> <dbl> <lgl> <dbl>
1 9557 32 217 (Apartmen… 889000 FALSE NA
2 9565 32 225 (Springfi… 918000 FALSE NA
3 75766 263 106 (Moe's Ta… 497000 FALSE NA
4 9583 32 243 (Train St… 960000 FALSE NA
5 9604 32 264 (Simpson … 1070000 FALSE NA
6 9655 33 0 (Simpson … 84000 FALSE NA
7 9685 33 30 (Simpson … 177000 FALSE NA
8 9686 33 31 (Simpson … 177000 FALSE NA
9 9727 33 72 (Simpson … 349000 FALSE NA
10 9729 33 74 (Simpson … 355000 FALSE NA
# ℹ 17,512 more rows
# ℹ 9 more variables: location_id <dbl>, raw_character_text <chr>,
# raw_location_text <chr>, spoken_words <chr>, normalized_text <chr>,
# word_count <chr>, name <chr>, normalized_name <chr>, gender <chr>
3. Pre-process the text by tokenizing the words and removing the stopwords.
# Step 1. Unnest tokens for spoken words
# Step 2. Remove stop words
sc_long <- sc %>%
filter(speaking_line) %>%
unnest_tokens(output = word, input = spoken_words) %>%
anti_join(get_stopwords())4. Count the words
sc_words <- sc_long %>%
count(word, sort = TRUE)5. Plot a graph of the top 20 spoken words
sc_words %>%
top_n(20, wt = n) %>%
ggplot(aes(x = fct_reorder(word, n),
y = n)) +
geom_col() +
labs(x = '',
y = 'count',
title = 'Top 20 words') +
coord_flip() 5. Tag the words with sentiments. First, count words spoken by each character.
sc_word_by_character <- sc_long %>%
count(name, word)
head(sc_word_by_character)# A tibble: 6 × 3
name word n
<chr> <chr> <int>
1 '30s Reporter burns 1
2 '30s Reporter got 1
3 '30s Reporter kinda 1
4 '30s Reporter mr 1
5 '30s Reporter sensational 1
6 '30s Reporter show 1
Using “afinn”, words will be tagged on a negative to positive scale of -5 to 5.
sc_s <- sc_word_by_character %>%
inner_join(get_sentiments("afinn"),
by = "word")
sc_s# A tibble: 33,232 × 4
name word n value
<chr> <chr> <int> <dbl>
1 1-Year-Old Bart good 1 3
2 1-Year-Old Bart like 1 2
3 1-Year-Old Bart nice 1 3
4 10-Year-Old Carl love 1 3
5 10-Year-Old Homer best 1 3
6 10-Year-Old Homer chance 1 2
7 10-Year-Old Homer cool 1 1
8 10-Year-Old Homer die 1 -3
9 10-Year-Old Homer died 1 -3
10 10-Year-Old Homer dreams 1 1
# ℹ 33,222 more rows
Compute the mean sentiment for each character.
sc_s %>%
group_by(name) %>%
summarise(m = mean(value, na.rm = TRUE)) %>%
arrange(desc(m))# A tibble: 4,197 × 2
name m
<chr> <dbl>
1 4-h Judge 4
2 ALEPPO 4
3 APU+ 4
4 All Kids 4
5 Applicants 4
6 Australian 4
7 Bill James 4
8 Canadian Player 4
9 Carl Kasell 4
10 Chipper Guide 4
# ℹ 4,187 more rows
Focus on the main characters, instead of all characters.
1. Keep characters that have spoken at least 999 lines
keep <- sc %>%
count(name,
sort=TRUE) %>%
filter(!is.na(name)) %>%
filter(n > 999)2. Re-compute the sentiment after removing unimportant characters:
sc_s %>%
filter(name %in% keep$name) %>%
group_by(name) %>%
summarise(m = mean(value)) %>%
arrange(desc(m))# A tibble: 16 × 2
name m
<chr> <dbl>
1 Waylon Smithers -0.0295
2 Lenny Leonard -0.0491
3 Seymour Skinner -0.106
4 Milhouse Van Houten -0.111
5 Krusty the Clown -0.132
6 Ned Flanders -0.137
7 Moe Szyslak -0.207
8 Apu Nahasapeemapetilon -0.214
9 C. Montgomery Burns -0.223
10 Chief Wiggum -0.274
11 Marge Simpson -0.281
12 Lisa Simpson -0.303
13 Grampa Simpson -0.304
14 Bart Simpson -0.315
15 Nelson Muntz -0.374
16 Homer Simpson -0.378
Section B
1. Bart Simpson is featured at various ages. How has the sentiment of his words changed over his life?
# Hint use string detect - run this example to see how it works
str_detect(c("Bart", "Homer", "30 year old Bart"), "Bart")[1] TRUE FALSE TRUE
sc_s %>%
filter(str_detect(name, "Bart")) %>%
group_by(name) %>%
summarise(m = mean(value)) %>%
arrange(desc(m))# A tibble: 31 × 2
name m
<chr> <dbl>
1 80-Year-Old Bart 3
2 Bart Head 3
3 Thought Bubble Bart 3
4 1-Year-Old Bart 2.67
5 2-Year-Old Bart 2
6 Bart Snail 2
7 Shorts Bart 2
8 St. Bartholomew 2
9 Swedish Bartender 2
10 Prince Bart 1.5
# ℹ 21 more rows
80 year old Bart is the most positive
2. Repeat the sentiment analysis with the “nrc” lexicon. What character is the most “angry”? “joyful”?
nrc <- get_sentiments("nrc")
sc_nrc <- sc_word_by_character %>%
inner_join(nrc, by = "word")
sc_nrc_main <- sc_nrc %>%
filter(name %in% keep$name)sc_nrc_main %>%
filter(sentiment == "anger") %>%
count(name, sort = TRUE)# A tibble: 16 × 2
name n
<chr> <int>
1 Homer Simpson 507
2 Lisa Simpson 393
3 Marge Simpson 368
4 Bart Simpson 366
5 C. Montgomery Burns 286
6 Seymour Skinner 196
7 Ned Flanders 170
8 Chief Wiggum 167
9 Moe Szyslak 166
10 Grampa Simpson 139
11 Krusty the Clown 139
12 Apu Nahasapeemapetilon 131
13 Milhouse Van Houten 96
14 Lenny Leonard 84
15 Nelson Muntz 84
16 Waylon Smithers 79
sc_nrc_main %>%
filter(sentiment == "joy") %>%
count(name, sort = TRUE)# A tibble: 16 × 2
name n
<chr> <int>
1 Homer Simpson 388
2 Marge Simpson 313
3 Lisa Simpson 312
4 Bart Simpson 275
5 C. Montgomery Burns 259
6 Seymour Skinner 202
7 Ned Flanders 168
8 Moe Szyslak 165
9 Krusty the Clown 146
10 Grampa Simpson 140
11 Chief Wiggum 127
12 Apu Nahasapeemapetilon 120
13 Milhouse Van Houten 114
14 Lenny Leonard 107
15 Waylon Smithers 90
16 Nelson Muntz 81
📚 Exercise 9D: Gutenberg (Optional)
Section A - Getting some books to study
The Gutenberg project provides the text of over 57,000 books free online.
Let’s explore “The Origin of the Species” by Charles Darwin using the gutenbergr R package.
We need to know the id of the book, which means looking this up online anyway.
- The first edition is
1228 - The sixth edition is
2009
1. Packages used
We need the tm package to remove numbers from the page, and gutenbergr to access the books.
# The tm package is needed because the book has numbers
# in the text, that need to be removed, and the
# install.packages("tm")
library(tidyverse)
library(tidytext)
library(tm)
library(gutenbergr)
library(broom)
library(plotly)2. Download darwin
darwin1 <- gutenberg_download(1228, mirror = "http://mirror.csclub.uwaterloo.ca/gutenberg")
darwin1# A tibble: 16,202 × 2
gutenberg_id text
<int> <chr>
1 1228 "Click on any of the filenumbers below to quickly view each ebo…
2 1228 ""
3 1228 "1228 1859, First Edition"
4 1228 "22764 1860, Second Edition"
5 1228 "2009 1872, Sixth Edition, considered the definitive edition…
6 1228 ""
7 1228 ""
8 1228 ""
9 1228 ""
10 1228 "On"
# ℹ 16,192 more rows
# remove the numbers from the text
darwin1$text <- removeNumbers(darwin1$text)3. Tokenize
- Break into one word per line
- Remove the stop words
- Count the words
- Find the length of the words
stop_words <- get_stopwords()
darwin1_words <- darwin1 %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
mutate(len = str_length(word))
darwin1_words# A tibble: 6,941 × 3
word n len
<chr> <int> <int>
1 species 1546 7
2 one 643 3
3 can 517 3
4 may 509 3
5 many 451 4
6 varieties 435 9
7 selection 412 9
8 forms 401 5
9 natural 384 7
10 two 345 3
# ℹ 6,931 more rows
4. Download and tokenize the 6th edition.
darwin6 <- gutenberg_download(2009, mirror = "http://mirror.csclub.uwaterloo.ca/gutenberg")
darwin6$text <- removeNumbers(darwin6$text)5. Show tokenized words using histogram.
ggplot(darwin1_words, aes(x = n)) +
geom_histogram(fill = "midnightblue")darwin1_words %>%
top_n(n = 20, wt = n) %>%
ggplot(aes(x = n,
y = fct_reorder(word, n))) +
geom_point() +
ylab("")darwin6_words <- darwin6 %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
mutate(len = str_length(word))
darwin6_words# A tibble: 8,956 × 3
word n len
<chr> <int> <int>
1 species 1921 7
2 one 808 3
3 may 663 3
4 many 614 4
5 can 589 3
6 forms 565 5
7 selection 561 9
8 natural 535 7
9 varieties 484 9
10 two 472 3
# ℹ 8,946 more rows
ggplot(darwin6_words, aes(x = n)) +
geom_histogram(fill = "midnightblue")darwin6_words %>%
top_n(n = 20,
wt = n) %>%
ggplot(aes(x = n,
y = fct_reorder(word, n))) +
geom_point() +
ylab("")6. Compare the word frequency - how often does the same word appear in each edition?
darwin <- full_join( # Full join joins everything together from both tables
darwin1_words,
darwin6_words,
by = "word"
) %>%
rename(
n_ed1 = n.x,
len_ed1 = len.x,
n_ed6 = n.y,
len_ed6 = len.y
)7. Plot the word frequency
ggplot(darwin,
aes(x = n_ed1,
y = n_ed6,
label = word)) +
geom_abline(intercept = 0,
slope = 1) +
geom_point(alpha = 0.5) +
xlab("First edition") +
ylab("6th edition") +
scale_x_log10() + # puts everything on a log scale, good to do for count data so that things with large counts don't dominate the scaling of the graph.
scale_y_log10() +
theme(aspect.ratio = 1)library(plotly)
ggplotly() # This will let us see the word labels for the points8. Book comparison. Idea: Find the important words for the content of each document by decreasing the weight of commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents.
Term frequency, inverse document frequency (tf_idf).
Helps measure word importance of a document in a collection of documents.
Recall:
\[ tf\_idf(w, d, \mathcal{D}) = tf(w,d) \times idf(w, \mathcal{D})\] where the term frequency (tf) is how often the word occurs as a fraction of all the words in the text and the idf is the number of times the word occurs over the collection of documents.
9. Bind the editions:
darwin_all <- bind_rows("first" = darwin1_words,
"sixth" = darwin6_words,
.id = "edition")
darwin_all# A tibble: 15,897 × 4
edition word n len
<chr> <chr> <int> <int>
1 first species 1546 7
2 first one 643 3
3 first can 517 3
4 first may 509 3
5 first many 451 4
6 first varieties 435 9
7 first selection 412 9
8 first forms 401 5
9 first natural 384 7
10 first two 345 3
# ℹ 15,887 more rows
10. Compute tf-idf
darwin_tf_idf <- darwin_all %>%
bind_tf_idf(word, edition, n)
darwin_tf_idf %>%
arrange(desc(tf_idf))# A tibble: 15,897 × 7
edition word n len tf idf tf_idf
<chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 first amongst 33 7 0.000423 0.693 0.000293
2 sixth among 42 5 0.000399 0.693 0.000277
3 sixth mivart 28 6 0.000266 0.693 0.000184
4 sixth prof 28 4 0.000266 0.693 0.000184
5 sixth cambrian 27 8 0.000257 0.693 0.000178
6 sixth illegitimate 21 12 0.000200 0.693 0.000138
7 sixth lamellæ 21 7 0.000200 0.693 0.000138
8 sixth pedicellariæ 19 12 0.000181 0.693 0.000125
9 sixth dimorphic 18 9 0.000171 0.693 0.000119
10 sixth fittest 17 7 0.000162 0.693 0.000112
# ℹ 15,887 more rows
11. Plot the results for top words
gg_darwin_1_vs_6 <- darwin_tf_idf %>%
arrange(desc(tf_idf)) %>%
group_by(edition) %>%
top_n(15, wt = tf_idf) %>%
ungroup() %>%
ggplot(aes(x = fct_reorder(word, tf_idf),
y = tf_idf,
fill = edition)) +
geom_col(show.legend = FALSE) +
labs(x = NULL,
y = "tf-idf") +
facet_wrap(~edition,
ncol = 2,
scales = "free") +
coord_flip() +
scale_fill_brewer(palette = "Dark2")gg_darwin_1_vs_6- Mr Mivart appears in the 6th edition, multiple times
str_which(darwin6$text, "Mivart") [1] 5541 7100 8124 8129 8132 8140 8146 8154 8170 8203 8226 8341
[13] 8351 8376 8381 8415 8561 8614 8654 8680 8723 8729 8752 8762
[25] 8827 8834 8909 9155 9167 9179 9210 9218 9281 16505 20768
darwin6[5541, ]# A tibble: 1 × 2
gutenberg_id text
<int> <chr>
1 2009 exceptions to this rule, as Mr. Mivart has remarked, that it has…
12. What do we learn?
- Prof title is used more often in the 6th edition
- There is a tendency for Latin names
- Mistletoe was misspelled in the 1st edition
Section B. Worked example - Comparing Darwin
1. Does it look like the 6th edition was an expanded version of the first?
# Look at number of words in each edition
darwin_all %>%
group_by(edition) %>%
summarise(total = sum(n))# A tibble: 2 × 2
edition total
<chr> <int>
1 first 77959
2 sixth 105243
2. What word is most frequent in both editions? (hint refer to plots above)
a. Find some words that are not in the first edition but appear in the 6th.
darwin %>%
filter(is.na(n_ed1))# A tibble: 2,347 × 5
word n_ed1 len_ed1 n_ed6 len_ed6
<chr> <int> <int> <int> <int>
1 among NA NA 42 5
2 mivart NA NA 28 6
3 prof NA NA 28 4
4 cambrian NA NA 27 8
5 illegitimate NA NA 21 12
6 lamellæ NA NA 21 7
7 pedicellariæ NA NA 19 12
8 dimorphic NA NA 18 9
9 fittest NA NA 17 7
10 orchids NA NA 17 7
# ℹ 2,337 more rows
b. Find some words that are used the first edition but not in the 6th.
darwin %>%
filter(is.na(n_ed6))# A tibble: 332 × 5
word n_ed1 len_ed1 n_ed6 len_ed6
<chr> <int> <int> <int> <int>
1 amongst 33 7 NA NA
2 experimentised 7 14 NA NA
3 weald 7 5 NA NA
4 cowslip 6 7 NA NA
5 primrose 6 8 NA NA
6 unmistakeable 5 13 NA NA
7 brighter 4 8 NA NA
8 downs 4 5 NA NA
9 lat 4 3 NA NA
10 _summary_ 3 9 NA NA
# ℹ 322 more rows
3. Using a linear regression model, find the top few words that appear more often than expected, based on the frequency in the first edition. Find the top few words that appear less often than expected.
darwin <- darwin %>%
mutate(log_n_ed1 = log1p(n_ed1),
log_n_ed6 = log1p(n_ed6))
word_darwin_lm <- lm(log_n_ed6 ~ log_n_ed1, data = darwin)There are more positive residuals than negative residuals in the residual plot.
darwin_narm <- darwin %>%
filter(!is.na(log_n_ed6) & !is.na(log_n_ed1))
darwin_aug <- augment(word_darwin_lm, darwin_narm)
ggplot(darwin_aug,
aes(x = log_n_ed1,
y = .resid,
label = word)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, size = 2, colour = "white")ggplotly()